Approaches to Black Box MT Evaluation
نویسنده
چکیده
In the course of four evaluations in the Advanced Research Projects Agency Machine Translation series, evaluation methods have evolved for measuring the core components of a diverse set of systems. This paper describes the methodologies in terms of the most recent evaluation of research and production MT systems, and discusses indications of ways to improve the focus and portability of the evaluation. 0. Introduction. Over the past four years, a set of evaluation methodologies have evolved within the MT initiative of the U.S. Advanced Research Projects Agency (ARPA). The ARPA program has faced unique challenges for evaluation, because the systems participating differ radically in the linguistic approach, their level of maturity, and the languages translated. The differences among these systems have made a black-box orientation to evaluation inevitable. While such an orientation differs from the methods that might be employed in the evaluation of a particular system by that system's developers, there are nevertheless certain advantages to the black box approaches in determining the focus and metrics of evaluation. This paper describes the methodologies of the ARPA program, in terms of their objectives, results, and evolution, and discusses analyses of the methods themselves to continue to improve the process. 1.0. Background. The ARPA initiative in machine translation began in 1991 as part of the Human Language Technologies Program. Three projects in MT research were sponsored under the initiative, with voluntary participation by several commercial and institutional MT organizations. The sponsored projects were: Candide (IBM Watson Research Center), a statistical modeling approach, translating French to English (Brown et al., 1993); Pangloss (Center for Machine Translation, Computing Research Laboratory, and Institute for Information Science), using knowledgebased approaches, translating Spanish and Japanese to English (Frederking et al., 1993); and Lingstat (Dragon Systems, Inc.), using a combination of modeling and rule-based approaches, translating Japanese and Spanish to English (Yamron et al., 1994). Many organizations have provided production MT systems for these evaluations, principally to assist ARPA's mission by helping to determine industry/discipline benchmarks. A significant goal of the ARPA MT Evaluation program, in return, is to provide a useful set of evaluation processes for a general standard. In the most recent test-evaluation cycle of August November 1994 (the "3Q94" evaluation), the following production systems participated: • the Sietec METAL system (French English); • the Nippon Electric PIVOT system (Japanese English); • Globalink Power Translator (French and
منابع مشابه
Can Automatic Post-Editing Make MT More Meaningful?
Automatic post-editors (APEs) enable the re-use of black box machine translation (MT) systems for a variety of tasks where different aspects of translation are important. In this paper, we describe APEs that target adequacy errors, a critical problem for tasks such as cross-lingual question-answering, and compare different approaches for post-editing: a rule-based system and a feedback approach...
متن کاملPart 5: Machine Translation Evaluation Chapter 5.1 Introduction
The evaluation of machine translation (MT) systems is a vital field of research, both for determining the effectiveness of existing MT systems and for optimizing the performance of MT systems. This part describes a range of different evaluation approaches used in the GALE community and introduces evaluation protocols and methodologies used in the program. We discuss the development and use of a...
متن کاملBlack-Box/Glass-Box Evaluation in Shiraz
The Shiraz project included an evaluation component: two ‘glass-box’ evaluations have been performed during the project as well as a black-box evaluation at the end of the project. The evaluations were based on the use of a bilingual tagged test corpus of 3000 sentences. Evaluation tools were developed in order to automate the evaluation process. The glass-box evaluations included the evaluatio...
متن کاملAdaptation of the Darpa Machine Translation Evaluation Paradigm to End-to-end Systems
The Defense Advanced Research Projects Agency (DARPA) Machine Translation (MT) Initiative spanned four years. One outcome of this effort was a methodology for evaluating the core technology of MT systems which differ widely in approach, maturity, platform, and language combination. This black box methodology, which proved capable of measuring performance of such diverse systems, used methods wh...
متن کاملProbing the Lexicon in Evaluating Commercial MT Systems
In the past the evaluation of machine translation systems has focused on single system evaluations because there were only few systems available. But now there are several commercial systems for the same language pair. This requires new methods of comparative evaluation. In the paper we propose a black-box method for comparing the lexical coverage of MT systems. The method is based on lists of ...
متن کامل